Back

Nature Medicine

Springer Science and Business Media LLC

Preprints posted in the last 7 days, ranked by how well they match Nature Medicine's content profile, based on 117 papers previously published here. The average preprint has a 0.16% match score for this journal, so anything above that is already an above-average fit.

1
Human vs AI Clinical Assessment: Benchmarking a Multimodal Foundation Model Against Multi-Center Expert Judgment on the Mental Status Examination.

Mwangi, B.; Jabbar Abdl Sattar Hamoudi, H.; Sanches, M.; Dogan, N.; Chaudhary, P.; Wu, M.-J.; Zunta-Soares, G. B.; Soares, J. C.; Martin, A.; Soutullo, C. A.

2026-04-20 psychiatry and clinical psychology 10.64898/2026.04.17.26351105 medRxiv
Top 0.1%
52.1%
Show abstract

The Mental Status Examination (MSE) is the cornerstone of the psychiatric evaluation, yet validating artificial intelligence (AI) against the inherent variance of clinical judgment remains a critical bottleneck. Here we introduce a multi-center framework to benchmark the open-weight multimodal foundation model Qwen3-Omni against independent expert panels at two sites, UTHealth and Yale. Evaluating 396 classifications across 10 MSE domains and three longitudinal timepoints of increasing symptom severity, we found that experts achieved substantial agreement (Gwets AC1 = 0.87), whereas the model achieved only moderate alignment (AC1 = 0.70-0.72). Even as the models overall pathology prediction rate approximated the experts, the aggregate equilibrium masked a profound "clinical reasoning gap". Specifically, the model systematically over-predicted observable signs (e.g., speech, affect) while notably failing in inferential domains requiring the interpretation of latent mental content (e.g., delusions, perceptions). A 4-bit quantization analysis of the model confirmed this mechanistically: reducing model capacity disproportionately degraded inferential reasoning while preserving perceptual feature extraction. Furthermore, model-to-expert agreement degraded linearly as clinical complexity intensified across longitudinal visits (Accuracy: T0 = 84.8-87%; T1 = 80-82%; T2 = 71-73%), whereas expert consensus remained robust. Notably, model errors increased 2.3-to-3.4 fold where human experts disagreed. These findings establish inter-expert variance as an essential measurable baseline for psychiatric AI, demonstrating that true clinical translation requires models to move beyond multimodal perceptual extraction to achieve higher-order diagnostic reasoning.

2
Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta Analysis

Zhu, L.; Wang, W.; Liang, Z.; Tan, W.; Chen, B.; Lin, X.; Wu, Z.; Yu, H.; Li, X.; Jiao, J.; He, S.; Dai, G.; Niu, J.; Zhong, Y.; Hua, W.; Chan, N. Y.; Lu, L.; Wing, Y. K.; Ma, X.; Fan, L.

2026-04-22 psychiatry and clinical psychology 10.64898/2026.04.21.26351365 medRxiv
Top 0.1%
27.4%
Show abstract

The rapid rise of large language models (LLMs) and foundation models has accelerated efforts to build artificial intelligence (AI) agents for mental health assessment, triage, psychotherapy support and clinical decision assistance. Yet a gap persists between healthcare and AI-focused work: while both communities use the language of "agents," clinical research largely describes monolithic chatbots, whereas AI studies emphasize agentic properties such as autonomous planning, multiagent coordination, tool and database use and integration with multimodal mental health data streams. In this Review, we conduct a systematic analysis of mental health AI agent systems from 2023 to 2025 using a six-dimensional audit framework: (i) system type (base model lineage, interface modality and workflow composition, from rule-based tools to role-aware multi-agent foundation-model systems), (ii) data scope (modalities and provenance, from elicited self-report and chatbot dialogues to electronic health records, biosensing and synthetic corpora), (iii) mental health focus (mapped to ICD-11 diagnostic groupings), (iv) demographics (age strata, geography and sex representation), (v) downstream tasks (screening/triage, clinical decision support, therapeutic interventions, documentation, ethical-legal support and education/simulation) and (vi) evaluation types (automated metrics, language quality benchmarks, safety stress tests, expert review and clinician or patient involvement). Across this corpus, we find that most systems (1) concentrate on depression, anxiety and suicidality, with sparse coverage of severe mental illness, neurocognitive disorders, substance use and complex comorbidity; (2) rely heavily on text-based self-report rather than clinically verified longitudinal data or genuinely multimodal inputs; (3) are implemented as single-agent chatbots powered by general-purpose LLMs rather than role-structured, workflow-integrated pipelines; and (4) are evaluated primarily via offline metrics or vignette-based scenarios, with few prospective, clinician- or patient-in-the-loop studies. At the same time, an emerging class of agentic systems assigns foundation models explicit roles as planners, retrieval agents, safety auditors or supervisors coordinating other models and tools. These multiagent, tool-augmented workflows promise personalization, safety monitoring and greater transparency, but they also introduce new risks around reliability, bias amplification, privacy, regulatory accountability and the blurring of clinical versus non-clinical roles. We conclude by outlining priorities for the next generation of mental health AI agents: clinically grounded, role-aware multi-agent architectures; transparent and privacy-preserving use of clinical and elicited data; demographic and cultural broadening beyond predominantly Western adult samples; and evaluation pipelines that progress from offline benchmarks to longitudinal, real-world studies with routine safety auditing and clear governance of responsibilities between agents and human clinicians.

3
Peer support boosted Hepatitis C treatment access among marginalised populations in England: A Bayesian causal factor analysis.

Schmidt, C.; Samartsidis, P.; Seaman, S.; Emmanouil, B.; Foster, G.; Reid, L.; Smith, S.; De Angelis, D.

2026-04-22 health policy 10.64898/2026.04.20.26351261 medRxiv
Top 0.1%
18.6%
Show abstract

To minimise health disparities, equitable access to medical treatment is paramount. In a pioneering intervention, National Health Service Englands Hepatitis C virus (HCV) programme has implemented country-wide peer support to boost treatment access. Peer support workers (peers) are individuals with relevant lived experience, who promote testing and treatment in marginalised populations underserved by traditional health services. We evaluated the English peers intervention, exploiting its staggered rollout and rich surveillance data between June 2016 and May 2021. Peers increased HCV cases identified by 13{middle dot}9% (95% credible interval (95% CrI) [5{middle dot}3, 21{middle dot}7]), sustained viral responses by 8{middle dot}0% (95% CrI [-4{middle dot}4, 18{middle dot}6]), and drug services referrals by 8{middle dot}8% (95% CrI [-12{middle dot}5, 22{middle dot}6]). The interventions effectiveness was magnified during the first COVID-19 lockdown and individuals supported by peers typically belonged to populations with poor treatment access. Our findings indicate that peers can boost equity in treatment access on a national scale.

4
Identify Patients at Risk of HIV Using a Clinical Large Language Model from Electronic Health Records

Liu, Y.; Chen, Z.; Suman, P.; Cho, H.; Prosperi, M.; Wu, Y.

2026-04-23 hiv aids 10.64898/2026.04.21.26351427 medRxiv
Top 0.1%
12.6%
Show abstract

This study developed a large language model (LLM)-based solution to identify people at HIV risk using electronic health records. We transformed structured EHR data, including demographics, diagnoses, and medications, into narrative descriptions ordered by visit date and applied GatorTron, a widely used clinical LLM trained on 82 billion words of de-identified clinical text. We compared GatorTron with traditional machine learning models, including LASSO and XGBoost. We identified a cohort with 54,265 individuals, where only 3,342 (6%) had new HIV diagnoses. Our LLM solution, based on GatorTron, achieved excellent performance, reaching an F1 score of 53.5% and an AUC of 0.88, comparable to traditional machine learning approaches. Subgroup analysis showed that, across age, sex, and race/ethnicity groups, both LLM and traditional models achieved AUCs above 0.82. Interpretability analyses showed broadly consistent patterns across LLM models and traditional machine learning models.

5
Plasma proteomics link menopause timing to brain aging and dementia risk

Wood Alexander, M.; Wood, B.; Oh, H. S.-H.; Bot, V. A.; Borger, J.; Galbiati, F.; Walker, K. A.; Resnick, S. M.; Ochs-Balcom, H. M.; Wyss-Coray, T.; Kooperberg, C.; Reiner, A. P.; Jacobs, E. G.; Rabin, J. S.; Casaletto, K. B.; Saloner, R.

2026-04-24 neurology 10.64898/2026.04.23.26351500 medRxiv
Top 0.1%
12.3%
Show abstract

Earlier menopause is a risk factor for several age-related diseases, including dementia. The biological pathways linking menopause timing to later-life brain aging are not understood. Leveraging large-scale plasma proteomics in postmenopausal women from the UK Biobank (N=15,012), earlier menopause was associated with upregulation of pro-inflammatory and extracellular matrix degradation pathways, plus accelerated aging across proteomic clocks of organ and cellular aging, including brain and oligodendrocyte aging. Elevated GDF15, a canonical aging marker, was the top protein correlate of earlier menopause. We observed robust replication of menopause timing proteomic shifts in the Women's Health Initiative Long Life Study (N=1,210). In UKB, proteins associated with earlier menopause, including GDF15, exhibited concordant associations with incident dementia risk and brain atrophy, cerebral small vessel disease burden, and white matter microstructural integrity. Collectively, our findings identify proteomic signatures linking ovarian aging to brain aging, providing a framework to inform interventions to reduce dementia risk.

6
Multi-BOUNTI: Multi-lobe Brain vOlUmetry and segmeNtation for feTal and neonatal MRI

Uus, A.; Fukami-Gartner, A.; Kyriakopoulou, V.; Cromb, D.; Morgan, T.; Arulkumaran, S.; Egloff Collado, A.; Luis, A.; Bos, R.; Makropoulos, A.; Schuh, A.; Robinson, E.; Sousa, H.; Deprez, M.; Cordero-Grande, L.; Bradshaw, C.; Colford, K.; Hutter, J.; Price, A.; O'Muircheartaigh, J.; Hammers, A.; Rueckert, D.; Counsell, S.; McAlonan, G.; Arichi, T.; Edwards, A. D.; Hajnal, J. V.; Rutherford, M. A.; Story, L.

2026-04-22 pediatrics 10.64898/2026.04.21.26351376 medRxiv
Top 0.1%
12.2%
Show abstract

Regional volumetric assessment of perinatal brain development is currently limited by the lack of consistent high quality multi-regional segmentation methods applicable to both fetal and neonatal MRI. We present Multi-BOUNTI, a deep learning pipeline for automated multi-lobe segmentation of fetal and neonatal T2w brain MRI. The method is based on a dedicated 43-label parcellation protocol and a 3D Attention U-Net trained on brain MRI datasets of subjects spanning 21-44 weeks gestational/postmenstrual age. The pipeline integrates preprocessing, segmentation and volumetric analysis, and was evaluated on independent datasets, demonstrating fast (< 10 min/case) and accurate performance with high agreement to manually refined labels. We demonstrate the application of the framework with 267 fetal and 593 neonatal MRI datasets from the developing Human Connectome Project without reported clinically significant brain anomalies to derive normative volumetric growth models across 21-44 weeks GA/PMA. These models were used to characterise developmental trajectories, assess differences between fetal and preterm neonatal cohorts, and analyse longitudinal changes. The resulting normative models were integrated into an automated reporting framework enabling subject-specific volumetric assessment via centiles and z-scores. Multi-BOUNTI provides a unified and scalable approach for perinatal brain segmentation and volumetry, supporting large-scale studies and facilitating future clinical translation. The full pipeline is publicly available at https://github.com/SVRTK/perinatal-brain-mri-analysis.

7
Legacy neuropsychiatric benefit after semaglutide is linked to maximum achieved dose and independent of the maximum weight lost

murugadoss, k.; Venkatakrishnan, A.; Soundararajan, V.

2026-04-23 endocrinology 10.64898/2026.04.16.26351060 medRxiv
Top 0.1%
9.9%
Show abstract

GLP-1 receptor agonists have reshaped obesity therapeutics, but their impact on neuropsychiatric outcomes remains poorly characterized. From 29 million patients in a large federated data platform across the USA, including 489,785 semaglutide treated patients, we conducted an observational study integrating longitudinal neuropsychiatric outcomes. From this population, we assembled a cohort of 63,215 patients with baseline neuropsychiatric conditions before treatment initiation and evaluated 24 incident neuropsychiatric outcomes. In propensity-matched comparator analyses, during the 2 year time-period from treatment initiation, semaglutide was associated with broadly lower neuropsychiatric event risk than metformin, SGLT2 inhibitors, and DPP-4 inhibitors. Within the semaglutide-treated cohort, higher attained dose during the first two years after the first prescription ("pre-landmark period") was associated with significantly lower incidence during the following two years ("post-landmark period") of diagnostic codes associated with substance-related disorders (P<0.001), mood disorders (P<0.001), anxiety- and stress-related disorders (P<0.001), CNS atrophies (P<0.001), neuromuscular disorders (P=0.013), eating/sleep/behavioral disorders (P=0.022), and personality/impulse-control disorders (P=0.028). Consistent with previous clinical trials, the post-landmark incidence of dementia or CNS degenerative diseases was similar between the high-dose and low-dose semaglutide cohorts (P=0.15). For most neuropsychiatric diagnoses, post-landmark incidence was strongly associated with the maximum attained semaglutide dose during the pre-landmark period, but incident cognitive symptoms and speech/language symptoms were more closely linked to the pre-landmark weight-loss magnitude (p<0.001 and p<0.003, respectively). Bulk and single-cell transcriptomic analyses demonstrated GLP1R expression in CNS tissues (hypothalamus, caudate, putamen, nucleus accumbens, cerebellum) and peripheral nerves. Age-associated heterogeneity in GLP1R expression was evident in several of these compartments including the caudate nucleus, suggesting dynamic changes in the availability of the neurobiological substrate for semaglutide response. Together, these data support a model in which semaglutide confers a sustained, dose-dependent, weight loss-independent benefit across multiple neuropsychiatric conditions via direct CNS target engagement. This observational study motivates prospective clinical studies and mechanistic analyses to clarify the impact of GLP-1 receptor agonists on human neuropsychiatric pathways and disease processes.

8
Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.

2026-04-23 neurology 10.64898/2026.04.22.26351488 medRxiv
Top 0.1%
8.7%
Show abstract

Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.

9
Rapid functional classification of cardiac genetic variants directly informs precision cardiology

Wang, X.; Chen, P.-T.; Mayourian, J.; Ripple, L.; Tharani, Y.; Shang, T.; Pavlaki, N.; Shani, K.; Jang, Y.; Janson, C.; Mah, D.; Parker, K. K.; Pu, W. T.; Ha, T.; Bezzerides, V.

2026-04-19 bioengineering 10.64898/2026.04.15.718512 medRxiv
Top 0.2%
8.1%
Show abstract

Large-scale clinical genome sequencing yields vast numbers of variants of unknown significance (VUSs). The high frequency of VUSs and the paucity of platforms to characterize their functional impact pose significant challenges for clinical decision making. Here, we present an integrated end-to-end platform, REVi-SCOPE (Rapid evaluation of variants in single cells by optogenetics and prime editing), for characterization of the impact of VUSs on cardiac physiology. Our strategy consists of (1) introduction of variants directly into wild-type (WT) human induced pluripotent stem cell-derived cardiomyocytes (hiPSC-CMs) via prime editing; (2) optogenetic assessment of calcium and membrane voltage dynamics in single hiPSC-CMs within the pool of edited and unedited cells; and (3) in situ single-cell genotyping of the phenotyped hiPSC-CMs with single-allele resolution. By optimizing and integrating each of these steps, we created a platform that enables VUS characterization in 10 days. We validated the REVi-SCOPEs capabilities by analyzing the properties of established arrhythmogenic variants. We then used REVi-SCOPE to reveal the functional impact of a VUS, TRPM4A320V, identified in a child with a conduction block. Together, our results show that REVi-SCOPE enables functional characterization of VUSs linked to cardiac arrhythmias with unprecedented throughput.

10
From Protocol to Practice: Graded Sepsis Bundle Compliance and Actionable Insights from Real-World ICU Data

TRIPATHI, H.; Roy, K.; Rahimi, S.; Neupane, S.; Bozorgzad, S.

2026-04-25 intensive care and critical care medicine 10.64898/2026.04.23.26351412 medRxiv
Top 0.2%
7.3%
Show abstract

Sepsis is a leading cause of in-hospital mortality, yet systematically evaluating temporal adherence to the Surviving Sepsis Campaign (SSC) bundle across large patient populations remains difficult due to semantic variability in electronic health records and the loss of clinical nuance inherent in binary pass/fail compliance judgments. We present an expert-guided neuro-symbolic pipeline that pairs LLM-based semantic normalization with a Sugeno fuzzy inference system encoding eight SSC bundle rules, producing graded per-episode compliance scores whose clinical decision boundaries are set through domain expert consultation. Applied to 2,438 sepsis episodes from MIMIC-IV v3.1, the dual-classifier normalization layer achieves substantial inter-system agreement with high embedding-based confirmation, resolving hundreds of clinically relevant drug strings that purely symbolic systems miss. The graded framework reveals that Hour-1 bundle failures, particularly antibiotic timing, are the dominant driver of low overall compliance, and that higher bundle adherence is associated with notably shorter ICU stays, with antibiotic delays beyond six hours increasing median stays by 61%. These results demonstrate that neuro-symbolic graded assessment can surface actionable compliance patterns that binary evaluation frameworks cannot capture.

11
Where risk becomes visible: a layered fixed-policy framework for diabetic kidney disease screening in type 2 diabetes

Khattab, A.; Wang, Z.; Srinivasasainagendra, V.; Tiwari, H. K.; Loos, R.; Limdi, N.; Irvin, M. R.

2026-04-22 nephrology 10.64898/2026.04.21.26351384 medRxiv
Top 0.2%
7.3%
Show abstract

BackgroundDiabetic kidney disease (DKD) is a leading cause of kidney failure in individuals with type 2 diabetes (T2D), yet risk identification in routine clinical practice remains incomplete. A critical and often overlooked barrier is risk observability: how much of a patients underlying risk is actually captured in their clinical record at the time of screening. Existing prediction models evaluate performance using model-specific thresholds, making it difficult to understand how additional data sources alter real-world screening behavior or which individuals benefit when models are expanded. MethodsWe developed a series of five nested machine learning models evaluated at a one-year landmark following T2D diagnosis using data from the All of Us Research Program (N = 39,431; cases = 16,193). Each successive model added a distinct information layer -- intrinsic risk, laboratory snapshots, medication exposure, longitudinal care trajectories, and social determinants of health (SDOH) -- while retaining all prior features. All models were evaluated under a fixed screening policy targeting 90% specificity, so that the false positive rate remained constant as the information available to the model grew. External validation was conducted in the BioMe Biobank (N = 9,818) without retraining. ResultsDiscrimination improved consistently across layers, from AUROC 0.673 (M1) to 0.797 (M5). Under the fixed screening policy, sensitivity nearly doubled from 0.27 to 0.49, with a cumulative recovery of 30.4% of cases missed by the base model. Gains were driven by distinct subgroups at each transition: laboratory features identified biologically high-risk individuals; medication features captured those with high treatment intensity reflecting advanced cardiometabolic burden; longitudinal care trajectory features rescued cases with biological instability observable only through repeated measurements; and SDOH features recovered individuals with limited clinical observability, with rescue probability highest among those with the fewest recorded monitoring domains. Sparse data in the clinical record indicated low observability, not low risk. Social and genetic features each contributed most when downstream physiologic signal was limited, supporting a contextual rather than universal role for each. In BioMe, discrimination was attenuated (M4 AUROC 0.659), but the relative ordering of information layers was fully preserved, and a systematic upward shift in predicted probability distributions underscored the need for recalibration before deployment in a new setting. ConclusionsDKD risk detection in T2D is substantially improved by integrating complementary information layers under a fixed clinical screening policy, with gains arising from distinct domains that identify at-risk individuals in different clinical contexts. The layered landmark framework introduced here reveals how risk observability -- shaped by monitoring intensity, healthcare engagement, and access -- determines what a screening model can detect, and provides a foundation for context-aware EHR-based screening that accounts for data availability at the time of risk assessment. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=140 SRC="FIGDIR/small/26351384v1_ufig1.gif" ALT="Figure 1"> View larger version (51K): org.highwire.dtl.DTLVardef@1cc7f4borg.highwire.dtl.DTLVardef@b92956org.highwire.dtl.DTLVardef@48ffbcorg.highwire.dtl.DTLVardef@8dc627_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOGraphical abstract.C_FLOATNO Study design and layered DKD screening framework The top row defines the cohort timeline, in which predictors are derived from clinical data collected between T2D diagnosis and the 1-year landmark, and incident DKD is ascertained after the landmark. The second row depicts the nested model architecture, in which five successive models sequentially incorporate intrinsic risk, laboratory snapshot features, medication exposure, longitudinal care trajectories, and social determinants of health, while retaining all features from prior layers. The third row summarizes model development in the All of Us Research Program (N = 39,431) and external validation in the BioMe Biobank (N = 9,818), where the same trained models and risk thresholds were applied without retraining. The bottom row highlights the three evaluation domains: predictive performance, fixed-policy screening, and missed-case recovery context. DKD, diabetic kidney disease; T2D, type 2 diabetes; PRS, polygenic risk scores; AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; PPV, positive predictive value; SHAP, SHapley Additive exPlanations. C_FIG

12
Deep Learning Reveals the Modular Genetic Architecture of Cardiovascular Aging

Choi, R. B.; Croon, P. M.; Perera, S.; Oikonomou, E.; Khera, R.

2026-04-24 cardiovascular medicine 10.64898/2026.04.22.26351478 medRxiv
Top 0.2%
7.1%
Show abstract

Chronological age is a potent determinant of clinical events, but it is conventionally treated as a linear function of time rather than a dynamic process shaped by genetics and tissue-specific senescence. Deep learning models derived from cardiovascular imaging offer an opportunity to quantify biological age across multiple domains and to examine the extent to which these measures capture shared or distinct vulnerabilities. Here, we applied deep learning to estimate biological age from electrocardiograms, cardiac MRI, carotid ultrasound, and retinal imaging, capturing electrical, structural, macrovascular, and microvascular domains in more than 100,000 UK Biobank participants. Genome-wide association and cross-trait heritability analyses showed that cardiovascular aging is not a singular process but a modular phenotype with distinct genetic determinants across modalities. Polygenic risk scores supported these distinct trajectories, showing that different biological age measures capture partly divergent biological processes with corresponding differences in clinical associations. Modality-specific genes also showcased distinct cell-type enrichment patterns. By deconvoluting aging into electrical, structural, macrovascular, and microvascular components, our results demonstrate that AI-derived age metrics capture distinct, disease-specific aging pathways. Ultimately, this modular framework positions deep learning-derived aging models not as holistic measures of health, but as domain-specific biomarkers of cardiovascular vulnerability.

13
The FEES Dysphagia Index: a bias-resilient continuous score that captures expert clinical judgment in 2,943 neurological inpatients

Werner, C. J.; Sanchez-Garcia, E.; Mall, B.; Meyer, T.; Pinho, J.; Schulz, J. B.; Schumann-Werner, B.

2026-04-21 neurology 10.64898/2026.04.20.26351259 medRxiv
Top 0.3%
6.4%
Show abstract

Multi-consistency testing during flexible endoscopic evaluation of swallowing (FEES) is clinically necessary but introduces selection bias: worst scores inflate severity because the number of consistencies tested covaries with disease severity. In this retrospective observational study of hospitalized neurological patients, we derived and validated the FEES Dysphagia Index (FDI) in two temporally independent cohorts (Cohort 1: 2013-2018, N=1,257; Cohort 2: 2021-2025, N=1,686) from a single center. FDI-S averages Penetration-Aspiration Scale (PAS) scores across tested consistencies (0-100 scale); FDI-E uses Yale Pharyngeal Residue scores; FDI-C combines both. Selection bias was quantified using sequential branching-tree inverse probability weighting (IPW). Worst PAS overestimated severity by 24%; FDI deviated by <2%. FDI-C was significantly superior to Worst PAS for hospital-acquired pneumonia (HAP; AUC 0.70 vs. 0.60, p<0.001), mortality (0.71 vs. 0.62, p=0.040), and restricted oral intake (0.90 vs. 0.74, p<0.001), and statistically equivalent to clinician-rated severity. FDI-C mapped linearly onto ordinal Functional Oral Intake Scale values (FOIS; proportional odds RCS p=0.99). With functional status and diagnosis, FDI-C reconstructed the clinicians oral intake recommendation with AUC up to 0.93. The FDI-C-mortality relationship was sigmoidal with a clinically relevant transition zone between [~]50 and [~]85. FDI-C is a bias-resilient, bedside-calculable score with interval-scale properties that captures expert clinical judgment, suitable as both a clinical decision support tool and a continuous research endpoint.

14
OpenEvidence errs on the safe side in a structured test of triage recommendations

Jia, E.; Omar, M.; Barash, Y.; Brook, O. R.; Ahmed, M.; Kruskal, J. B.; Gorenshtein, A.; Klang, E.

2026-04-24 health informatics 10.64898/2026.04.23.26351526 medRxiv
Top 0.3%
6.3%
Show abstract

Ramaswamy et al. recently reported in Nature Medicine that ChatGPT Health, a consumer-facing health AI tool, undertriaged 51.6% of true emergencies. It was also susceptible to social anchoring in a structured stress test of triage recommendations. We applied the same vignette-based benchmark to OpenEvidence, a widely used physician-facing AI platform for clinical decision support. The benchmark included 960 prompts across 21 clinical domains (Supplementary Table S3). OpenEvidence undertriaged 12.5% of emergencies, a four-fold reduction relative to ChatGPT Health. It also showed no anchoring effect. Its errors skewed in a safer direction, including 68.0% overtriage of Home presentations. In 65 of 960 responses (6.8%), it declined to assign a triage level. These refusals occurred only in symptom-only prompts and never in urgent or emergency cases. Performance improved when objective clinical data were provided. Under the same benchmark, a widely used physician-facing system showed a different safety profile from a consumer-facing one. This suggests that who a health AI is built for can shape how it fails.

15
Onca: An Open 9B Language Model for Pancreatic Cancer Clinical Tasks

Shim, K. B.

2026-04-24 oncology 10.64898/2026.04.16.26351055 medRxiv
Top 0.4%
6.2%
Show abstract

Pancreatic ductal adenocarcinoma (PDAC) remains one of the deadliest solid tumors and continues to face low treatment-trial participation, fragmented evidence workflows, and labor-intensive ab- straction of unstructured clinical text. Existing oncology-focused language models show promise, but many depend on private institutional corpora, limiting reproducibility and practical reuse across centers. We present Onca, an open 9B dense model designed for four PDAC-relevant tasks: trial eligibility screening, case-specific clinical reasoning, structured pathology report extraction, and molecular variant evidence reasoning. Onca is fine-tuned from Qwopus3.5-9B-v3 with a single Un- sloth BF16 LoRA adapter on 37,364 training rows drawn from openly available sources. The evalu- ation spans 11 panels and compares Onca against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unmodified Qwopus base. Onca achieves the strongest overall results on Trial Screening (81.6 F1), Clinical Reasoning (14.1 composite), Pathology Extraction (30.5 field exact-match), Pub- MedQA Cancer (68.3 macro-F1), and PubMedQA (66.5 macro-F1). The strongest gains appear in tasks closest to routine oncology workflow, especially trial review and pathology structuring. These findings suggest that clinically targeted pancreatic-cancer language models can be built from open data with competitive performance while remaining practical to train on a single workstation-scale GPU setup.

16
CalPred yields calibrated intervals for polygenic risk prediction

Shi, Z.; Zhang, Z.; Mandla, R.; Hou, K.; Pasaniuc, B.

2026-04-22 genetic and genomic medicine 10.64898/2026.04.21.26351410 medRxiv
Top 0.4%
4.9%
Show abstract

Polygenic scores (PGS) have emerged as a useful biomarker for stratification of high-risk individuals in genomic medicine, with prediction intervals arising as a principled approach to incorporate statistical uncertainty in their individual-level predictions. In contrast to recent reports by Xu et al7, we show that CalPred6 provides well-calibrated prediction intervals that contain the trait phenotypes at targeted confidence levels. CalPred maintains calibration when PGS performance varies across contextual factors (e.g., ancestry, age, sex, or socio-economic factors) whereas PredInterval7 - a recently introduced method that focuses on marginal calibration across all individuals - exhibits miscalibration.

17
Blood-to-tissue translation in autoimmune disease: paired single-cell evidence from systemic sclerosis

Rajeevan, N.; Khan, Z.

2026-04-21 immunology 10.64898/2026.04.18.719421 medRxiv
Top 0.5%
4.4%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWThe biology that governs progression and therapeutic response in autoimmune disease is organized in affected tissue, but direct molecular readout of that biology requires invasive biopsy and is rarely repeated during clinical trials or routine care. Using paired blood-skin single-cell RNA-sequencing from a systemic sclerosis (SSc) cohort of 74 individuals (57 patients and 17 matched controls, 192,809 cells across 53 annotated cell states), we show that peripheral blood carries a recoverable projection of tissue-resident molecular state. Across 63 pathways scored in both compartments, 43 same-pathway blood-skin associations reach FDR < 0.05; at cell-type resolution, 212 cross-compartment associations survive residualization for disease status and sex. Per-patient classifiers recover tissue-defined molecular states out of fold with AUCs between 0.62 and 0.79, with the strongest recoveries on fibroblast subtype programs that have no direct circulating analog: fibroblast COMP at 0.79, COCH at 0.75, MYOC2 at 0.74, POSTN at 0.74. Tissue programs route through different blood compartments at different representational levels: fibroblast programs resolve through T-cell, Treg, monocyte and B-cell axes at compositional and distributional levels, while interferon resolves through expression state across multiple cell types. Within SSc alone, a cross-validated partial least squares model learns a shared blood-skin latent axis at r = 0.486 (permutation p = 0.006); the induced patient ranking recovers tissue-interferon-high patients at 86% precision at the top-20% screening threshold against a 50% base rate. A paired multiview autoencoder, trained on module-level dependency structure under contrastive alignment, paired reconstruction, neighborhood preservation and tissue-target supervision, learns a shared latent geometry in which blood-only projections land in the same tissue-state region as their matched tissue samples and supports recovery of held-out tissue targets above simpler baselines and above two permutation null families. These results map the empirical geometry of cross-compartment inference in autoimmune disease and position peripheral blood as a substrate for tissue-state inference at trial and clinical scale.

18
Generalizing intensive care AI across time scales in resource-limited settings

Devadiga, A.; Singh, P.; Sankar, J.; Lodha, R.; Sethi, T.

2026-04-24 health informatics 10.64898/2026.04.23.26351588 medRxiv
Top 0.5%
4.3%
Show abstract

Temporal resolution of physiological monitoring in intensive care varies widely across healthcare systems. Artificial intelligence models assume a uniform and fixed frequency of sampling, thus limiting the generalizability of models, especially to resource-limited settings. Here, we propose a novel resolution-transfer task for physiological time series and ask whether models trained on high-resolution data can generalize to a low data-density setting without the need to retrain them. SafeICU, a novel longitudinal pediatric intensive care dataset spanning ten years from a tertiary care hospital in India, was used to test this hypothesis. Self-supervised transformer models were trained on 144,271 patient-hours of high-resolution physiological signals from 984 pediatric ICU stays to learn representations of heart rate, respiratory rate, oxygen saturation, and arterial blood pressure. Transfer of this model to low-resolution data established robust performance in clinically relevant lower-frequency intervals, consistently outperforming models trained directly at coarser resolutions. Further, these representations generalized across patient populations, maintaining performance when evaluated on adult intensive care cohorts from the MIMIC-III and eICU databases without retraining. In a downstream task of early shock prediction, models achieved strong discrimination in the pediatric cohort (area under the receiver operating characteristic curve (AUROC) 0.87; area under the precision-recall curve (AUPRC) 0.92) and retained stable performance across monitoring intervals from 10 to 60 minutes (AUROC 0.78-0.88). Together, these results demonstrate that physiological representations learned from high-resolution data enable time-scale-robust and transferable AI for intensive care. The publicly released SafeICU dataset, comprising longitudinal vital signs, laboratory measurements, treatment records, microbiology, and admission and discharge, provides a foundation for developing and deploying generalizable clinical AI in resource-limited settings.

19
Biobank-scale survey of gene-diet interactions informs precision nutrition polygenic scores

Di Scipio, M.; Man, A.; Lali, R.; Wu, J.; Le, A.; Franks, P. W.; Pare, G.

2026-04-20 genetic and genomic medicine 10.64898/2026.04.13.26350340 medRxiv
Top 0.8%
3.6%
Show abstract

Genome-guided dietary advice is a goal of precision nutrition. However, the contribution of gene-diet interactions (GxDs) to disease risk remains unclear, hindering the identification of diet-outcome pairs more likely amenable to genetic-based recommendations. We thus implemented a two-step approach: first, we comprehensively assessed the contributions of genome-wide GxDs to cardiometabolic outcomes across a broad array of dietary exposures in UK Biobank participants (N = 141,144 to 325,989). Second, we selected the 20 significant diet-outcome pairs from the 713 pairs tested (p < 7.0 x 10-5) and derived GxD polygenic scores. In an independent sample, all scores were nominally associated with their corresponding outcomes, with 12 of 20 polygenic scores Bonferroni significant (p < 0.0025). Further analyses revealed GxD polygenic scores were associated with clinical outcomes such as incident gout, suggesting translational potential. Altogether, these results showcase the promise of GxD scores to inform precision nutrition.

20
Behavioral and psychological symptoms of dementia: insights from a multivariate and network-based brain proteome-wide study

Vattathil, S. M.; Duong, D. M.; Gearing, M.; Seyfried, N. T.; Wilson, R. S.; Bennett, D. A.; Woltjer, R. L.; Wingo, T. S.; Wingo, A. P.

2026-04-24 genetic and genomic medicine 10.64898/2026.04.23.26351110 medRxiv
Top 0.9%
3.5%
Show abstract

Behavioral and psychological symptoms of dementia (BPSD) are common, profoundly troubling to patients and caregivers, and difficult to treat, yet their molecular underpinnings remain poorly understood. Here, we generated the first brain proteomic dataset with BPSD phenotyping, profiling the dorsolateral prefrontal cortex of 376 donors from three cohorts spanning nine BPSD domains assessed in life. Protein associations with BPSD were examined using complementary approaches - domain-specific BPSD, multi-domain BPSD, and latent factor modeling - and integrated via cross-cohort meta-analysis. Four proteins (NMT1, DCAKD, DNPH1, and HIBADH) were associated with anxiety in dementia and five proteins (ABL1, SAP18, PLXND1, CTRB2, and LDHD) with multi-domain BPSD or BPSD latent factors after adjusting for sex, age, and other covariates (FDR < 0.05). Additionally, eight protein co-expression networks were associated with BPSD across cohorts. These results link BPSD to dysregulation of synaptic signaling, protein folding, and humoral immune response, providing a molecular framework for therapeutic discovery.